| Name | Version | Summary | date | 
        
            
                | tool-scorer | 1.3.3 | Catch LLM agent regressions before deployment. Test tool-calling accuracy for OpenAI, Anthropic, Gemini with pytest integration and CI/CD workflows. | 2025-10-28 21:46:55 | 
        
            
                | generic-llm-api-client | 0.1.2 | A unified, provider-agnostic Python client for multiple LLM APIs | 2025-10-28 13:32:06 | 
        
            
                | bbob-jax | 0.5.0 | BBOB Benchmark function implemented in JAX | 2025-10-27 11:15:39 | 
        
            
                | novaeval | 0.6.1 | A comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models | 2025-10-27 11:10:57 | 
        
            
                | lmur | 2.1.1 | Neural Network Dataset | 2025-10-27 09:53:42 | 
        
            
                | nn-dataset | 2.1.1 | Neural Network Dataset | 2025-10-27 09:45:16 | 
        
            
                | ariadne-router | 0.4.0 | Intelligent quantum simulator router with automatic backend selection | 2025-10-26 22:11:56 | 
        
            
                | LevDoom | 1.0.3 | LevDoom: A Generalization Benchmark for Deep Reinforcement Learning | 2025-10-25 17:16:51 | 
        
            
                | raven-pyu | 1.0.2 | Utilities for Python | 2025-10-20 11:23:33 | 
        
            
                | insdc-benchmarking-schema | 1.2.0 | JSON schema and validation for INSDC benchmarking results | 2025-10-16 11:37:57 | 
        
            
                | lrdbenchmark | 2.2.0 | Comprehensive Long-Range Dependence Benchmarking Framework with Classical, ML, and Neural Network Estimators + 5 Demonstration Notebooks | 2025-10-14 09:26:25 | 
        
            
                | LLMEvaluationFramework | 0.0.21 | Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing | 2025-10-12 08:37:49 | 
        
            
                | guidellm | 0.3.1 | Guidance platform for deploying and managing large language models. | 2025-10-10 13:40:23 | 
        
            
                | mcpuniverse | 1.0.3 | A framework for developing and benchmarking AI agents using Model Context Protocol (MCP) | 2025-10-07 08:23:22 | 
        
            
                | mlbench-lite | 2.0.3 | A simple machine learning benchmarking library | 2025-09-18 20:34:55 | 
        
            
                | causallm | 4.2.0 | Production-ready causal inference with comprehensive monitoring, testing, and LLM integration | 2025-09-09 17:14:52 | 
        
            
                | omnibench | 0.1.2 | Comprehensive AI Agent Benchmarking Framework | 2025-09-08 22:17:51 | 
        
            
                | clyrdia-cli | 2.0.1 | State-of-the-Art AI Benchmarking for CI/CD | 2025-09-08 16:25:53 | 
        
            
                | kode-kronical | 0.7.1 | A lightweight Python performance tracking library with automatic data collection and visualization | 2025-09-02 06:11:35 | 
        
            
                | mnt.bench | 0.3.7 | MNT Bench - An MNT tool for Benchmarking FCN circuits | 2025-09-01 16:38:38 |